class three
dorng)step())apply() functions can help, but still doesn’t use multiple processorsparallel package (thanks, Miranda!)We would like to find a way to make use of our whole computer, and make useful tasks like bootstrapping available to use, but without having to invest large amounts of time in learning new programming languages. Enter foreach, which keeps the structure of a for loop, but allows us to drop two key assumptions:
We will begin with a simple chunk of R code involving a for loop, and transform it. Along the way, we’ll take a look at the equivalent computation done with an apply() function, and see that using foreach and multiple processors outperforms this too.
We are going to look at data from the New York City bikeshare program (Citibike).
Goal: find a model which can offer good prediction. Start with a few plausible models and use K fold cross validation to decide which one to use.
<—! loess smoother —>
We now suppose that we’re considering three increasingly complex models of the arrival behavior. In order to compare the three models prediction error, we’ll use K fold cross validation, where we use the same folds for all three models.
Here’s some familiar code that sets things up:
source("~/git/fixschewitt/foreach/citibike_new_york/EDA/subsetting_arrivals_attributes.R")
get.errs <- function(test.set = NULL,
discarded = NULL) {
sml.glm <- glm(arrivals ~
bs(hour, degree = 4)
+ weekend
+ as.factor(id),
data = arrivals.sub[-c(discarded, test.set), ],
family = "poisson")
med.glm <- glm(arrivals ~
bs(hour, degree = 4)*weekend
+ as.factor(id),
data = arrivals.sub[-c(discarded, test.set), ],
family = "poisson")
big.glm <- glm(arrivals ~
bs(hour, degree = 4)*weekend
+ bs(hour, degree = 4)*as.factor(id),
data = arrivals.sub[-c(discarded, test.set), ],
family = "poisson")
sml.err <- mean((predict(object = sml.glm,
newdata = arrivals.sub[test.set, -7],
type = "response") -
arrivals[test.set, 7])^2)
med.err <- mean((predict(object = med.glm,
newdata = arrivals.sub[test.set, -7],
type = "response") -
arrivals[test.set, 7])^2)
big.err <- mean((predict(object = big.glm,
newdata = arrivals.sub[test.set, -7],
type = "response") -
arrivals[test.set, 7])^2)
return(sqrt(c(sml.err, med.err, big.err)))
}
Next, we make our K-fold test sets (and implicitly, our training sets):
K <- 50
N <- dim(arrivals.sub)[1]
## kill off 8 observations and make cv test sets
set.seed(1985)
discarded <- sample(1:N, size = 8)
cv.test.sets <- matrix(sample((1:N)[-discarded], size = N - 8), ncol = K)
Using a naive for loop, we could implement this as:
err.for <- NULL
system.time(
for (i in 1:K) {
err.for <- cbind(err.for, get.errs(test.set = cv.test.sets[, i],
discarded = discarded))
}
)
## user system elapsed
## 19.00 1.09 20.64
If you’re good with apply() functions you might upgrade to
## apply version
system.time(
err.apply <- sapply(X = 1:K,
FUN = function(i) {
get.errs(test.set = cv.test.sets[, i],
discarded - discarded)
}
)
)
## user system elapsed
## 17.513 0.945 18.796
Neither of the first two methods take advantage of multiple processors. While apply() functions avoid the inherently sluggish nature of for loops in R, they are still ignorant of the processor structure. We want to chop the job into halves, fourths, etc. and use the whole computer!
Here is the same computation written with a foreach() loop
## foreach version
library(foreach)
library(doParallel)
registerDoParallel(cl = 4)
system.time(
err.foreach <- foreach(i=1:K,
.inorder = FALSE,
.combine = "cbind") %dopar%
get.errs(test.set = cv.test.sets[, i],
discarded = discarded)
)
## user system elapsed
## 21.432 1.550 9.358
%do% and %dopar%.combine can take on the intuitive values c, cbind,.inorderSometimes the list or vector that we are iterating over (in the above case, the vector 1:N) can be a very large object. In our case, the vector is quite reasonable in size, but perhaps the object we were iterating over was a multi-dimensional object, with many values, and a high level of precision. In this case, we’d be storing a massive object, which could potentiall fill up our useable memory and slow things down. We could save memory by only keeping the piece of our list we need for a given calculation, and dumping the rest. This is the idea behind the iterators package in R.
library(iterators)
registerDoParallel(cl = 4)
system.time(
err.foreach.iter <- foreach(x = iter(cv.test.sets, by = "col"),
.inorder = FALSE,
.combine = "cbind") %dopar%
get.errs(test.set = x,
discarded = discarded)
)
## user system elapsed
## 28.859 2.119 10.455
random_
registerDoParallel(cl = __)